From: "World Chess Championship", INTERNET:newsletter@mark-weeks.com Date: 00/12/16, 11:36 Re: Chess History on the Web (2000 no.24) In case you're not following the FIDE World Chess Championship in New Delhi, round 6 finished after the four standard games were played; no playoff games were necessary. Anand beat Adams 2.5 - 1.5, while Shirov beat Grischuk by the same score. The six game final round between Anand and Shirov starts 20 December in Tehran. The Women's Championship was an all-Chinese final in which Xie Jun beat Qin Kanying 2.5 - 1.5 to retain the title. --- Site review - UPITT (III) In Chess History on the Web nos. 20 & 21, I looked at the University of Pittsburgh (UPITT) archive at... http://www.pitt.edu/~schach/ ...by examining the PGN (Portable Game Notation) game collections for Kasparov and Kramnik. I remarked that the UPITT Kasparov collection contains many duplicate games, while the Kramnik collection is almost five years out of date. Rather than concentrate on individual files, I'd now like to look at the archive as a whole. How big is it? How is it structured? What sort of game data is available? How complete is it? To do this, we'll be looking at some numbers and statistics. --- The PGN subdirectory is one of the main branches in UPITT's /group/student-activities/chess/ directory. It contains an index file and five subdirectories -- /Players, /Events, /Openings, /Collections, & /Demo. The index file (00INDEX.PG) is a text file which lists and describes all files in the PGN subdirectories. Those subdirectories contain PGN files in ZIP format, categorized by type, where the subdirectory name documents the type of files it contains. The /Players subdirectory, for example, contains collections of games by individual players. The /Events subdirectory, which contains files for tournaments, is further classified by year, with a subdirectory for each year 1992-2000 plus a subdirectory /0-1991 covering events held before 1992. The PGN subdirectory is parallel to other main branches which contain games in popular digital formats like Chess Assistant and ChessBase. These are all located under the /Chess directory, which has its own index file ALLINDEX.TXT, covering all of the /Chess branches. The PGN section of ALLINDEX.TXT matches 00INDEX.PG perfectly, so it is apparently built by concatenating the index files across all of the various subdirectories. To look at the PGN data in more detail, I captured the listings for all of the PGN subdirectories and loaded them into MS Access. I also captured the /Players subdirectory for the other digital formats, to allow comparisons between the PGN collection and other collections. The following table shows the number of files in each subdirectory and the total size of those files in megabytes. 32 9.2M PGN/Collections 3 0.1M PGN/Demo 1275 34.0M PGN/Events 148 49.3M PGN/Openings 216 32.6M PGN/Players 234 21.7M CA/Players 236 27.8M CB/Players 1 0.1M CB6/Players 165 9.4M NICB/Players The following table shows when the 1674 PGN files were loaded into the UPITT archive. 1995 167 1996 731 1997 431 1998 210 1999 82 2000 53 The following table shows how the 1275 PGN/Events files are distributed across the years in which the events were played. 1991 384 (0-1991) 1992 83 1993 51 1994 131 1995 219 1996 180 1997 90 1998 95 1999 34 2000 8 These last two tables show that the UPITT chess archive activity peaked in 1996 and has been declining ever since. The earliest PGN files are date stamped 1995-03-27 and the latest are stamped 2000-11-18. The archive activity has not stopped; it has just slowed down. I checked that the files listed in ALLINDEX.TXT matched the names shown in the PGN/Players subdirectory listing and found that five file names were misspelled. There were other discrepancies between the index and the file lists, but I didn't resolve them. The first table above shows that there are more files in the ChessBase CB/Players directory (236 files) than in the PGN/Players directory (216 records). Why is there a difference? Comparing the content of the two directories, I discovered several reasons. 1) The Lasker files are named differently:- - LASK_PGN.ZIP : 'Emanuel Lasker PGN Coll. in 12 files, from Nick Pope' - LASKEREM.ZIP : listed in CB/Players but not in ALLINDEX.TXT I didn't check if the files contain the same games. 2) The two PGN/Players SOVGMPG1/-2.ZIP files correspond to the single CB/Players SOVGM-CB.ZIP ['Games of Soviet Giants: Balashov (1835), Glek (895), Kupreichik (1440), Psakhis (1693), Romanishin (1990), Sveshnikov (1507)'; The name 'Soviet Giants' sounds like a sports team, doesn't it?]. The equivalent PGN file was probably so large that its creator felt that it had to be split in two. This underscores a general problem with PGN files. The PGN format is not very efficient for storing chess data, even when compressed. 3) Most of the differences between the PGN & CB Players files are caused by large collections covering multiple grandmasters, which are in the CB directory, but not in the PGN directory. There are, for example, seven CB files named SOVGM2 through SOVGM8 which have no corresponding PGN file. 4) I also found for individual grandmasters another seven CB files which have no PGN equivalent. Beliavsky & Larsen are two examples. Since UPITT has utilities to convert from ChessBase format to PGN format, it is easy to create PGN files from CB files. How many players are covered in total? Some of the multiple grandmaster collections duplicate collections for single players, while Carlos Torre has two different PGN files. After eliminating these duplicates, my database routines counted 275 different players covered by UPITT collections. --- Who's missing? This is one of those questions that's easy to ask, but not so easy to answer. Needing an objective list of great players, I turned to two sources. The first was an offline resource, 'The Rating of Chessplayers, Past and Present' by Arpad E. Elo, B.T.Batsford, London, 1978. Appendix 9.4 of Elo's book is an 'All-Time List of FIDE Titleholders'; Appendix 9.5 is an 'All-Time List of Great Untitled Players'. I scanned the two chapters, fed the images into character recognition software, and created two tables equivalent to Elo's data, one table for each appendix. If you've ever used OCR techniques, you probably know that their output is time consuming to correct and the results are error prone. It still beats retyping everything. I did my best to ensure that my copy of Elo's data was clean, but there is always a possibility that errors are present. The 'All-Time List of FIDE Titleholders' has 590 players along with their 'best 5-year average' (270 players) and/or their FIDE rating as of 1978-01-01 (434 players). Of these, 118 players are listed with both and 4 players are listed with neither. Those last four are women where Elo probably did not have enough samples in his data. The 'best 5-year average' is a rating which Elo calculated based on available competitive data. Note that 74 of the 590 players were deceased at the time that the list was compiled. I wanted one rating for each player. In a footnote to the table, Elo remarked, 'A best five-year average rating is not shown where data are insufficient or where it is lower than the 1-1-78 rating.' To continue, I used (1) the best 5-year average for the 270 players listed with an average, and (2) the 1-1-78 rating for the 316 players with no best 5-year average. The following table shows how these ratings are distributed. ELO Count 27xx 4 26xx 32 25xx 133 24xx 331 23xx 80 22xx 6 While we're discussing this list of FIDE titled players, let's digress and look at some statistics related to the title. A count of titles for the 590 players yields 178 GMs & 412 IMs. Elo's data also shows in what year these titles were granted, so it's possible to count how many titles were granted each year. A problem here is that most GMs first earn the IM title, but Elo listed only the year that the GM title was earned. This means that any statistics for IMs are really for 'IMs who hadn't yet earned a GM title', which skews the results for the IMs. The first FIDE titles were granted in 1950 -- Elo lists 27 GM & 72 IM titles. Of these players, 22 were also on the 1978 rating list. The 27 GMs had best 5-year average ratings ranging from 2720 (Botvinnik) to 2490 (Mieses & Sämisch), where the mean was 2605 and the median was 2610. The 71 IMs (Ludmila Rudenko is excluded because she is not listed with a best 5-year average) had best 5-year average ratings ranging from 2540 (Atkins & Mikenas) to 2380 (Wade) [mean=2468; median=2470]. There were undoubtedly other IMs named in 1950 who later became GMs. It appears that some politics were involved in the initial granting of titles. Bogolyubov was not granted the GM title until 1951, although he played Alekhine for the world champion title in 1929 & 1934, and Elo calculated a best 5-year average of 2610. Elo listed the year of birth for all but six players, so we can calculate some statistics based on their ages. This is not completely accurate. A player born on 1 January gives the same result as a player born on 31 December of the same year. The same problem applies to the date on which a title was earned or granted. It's also not clear which year Elo used -- earned or granted. Age calculations may be off by a year or two, so none of the following remarks should be taken too seriously. The oldest player to be granted the GM title in 1950 was Jacques Mieses (b.1865); the youngest was David Bronstein (b.1924). The oldest player to be granted the IM title in 1950 was Henry Atkins (b.1872); the youngest was Andreas Dückstein (b.1927). The average age of the GMs was 49, and of the IMs, 44. It's possible that Bronstein and Dückstein earned their titles competitively; Elo's data doesn't differentiate these from historical titles. I don't believe that FIDE has ever stopped awarding titles based on historical results rather than on current results. Let's look at the 491 players granted titles after 1950. In 1954, the 5 new GMs had an average age of 40, while the 8 IMs averaged 38. In 1955, the 5 new GMs had an average age of 22, while the 5 IMs averaged 33. I'll use 1955 as a starting year to calculate ages at which a title was earned. This reduces the set from 485 (491 minus 6) to 424 players. The youngest player to receive the GM title was Fischer, 15 years old in 1958. He is followed by Spassky (18/1955) and Karpov (19/1970). The youngest player to receive the IM title was Mehrshad Sharif of Iran, 13 years old in 1965. Don't forget that all these records are as of 1978, and have since been superseded. The oldest player to receive the GM title was Esteban Canal of Italy, who was 81 in 1977. This looks like an honorary title, as the three next oldest players on the list also received the GM title in 1977 -- Borislav Milic (age 52 in 1977), Julio Bolbochan (57), and Carlos Torre (73). The oldest player to receive the IM title was Edward Lasker of the United States, who was 78 in 1963. Across all years 1955 to 1977, the 133 GMs had an average age of 30.4 the year in which they earned the title; the 291 IMs averaged 31.0. Again, I don't want to mislead anyone. These numbers are only a rough estimate. If we could eliminate the honorary titles, both averages would decrease; if we could add the IMs who later became GMs, the IM average would decrease again. If we ignore the titles granted in 1950, we see that around 10-20 titles were granted per year. The peak year was 1965, when 28 titles (11 GM and 17 IM) were granted. In 1973 and 1974, 19 titles were granted each year; in 1975 and 1976, 54 titles were granted each year; in 1977, the last year of Elo's data, 57 titles were granted. Note the big jump from 1974 to 1975. From 1951 to 1974, an average of 14 titles (5 GM and 9 IM) were granted each year. From 1975 to 1977, an average of 55 titles (15 GM and 40 IM) were granted each year. These trends continued. The FIDE rating list at... http://www.fide.com/fide/html/ratings.html ...links to OCT00FRL.ZIP, the 'Download FIDE Rating List for 1 Oct, 2000'. The ZIP file contains a single file ALPHAOCT.TXT, with a size of 2.9M. When I loaded it into MS Access for analysis, I found 35775 players, including 712 GMs and 1995 IMs. Elo's 'All-Time List of Great Untitled Players' has 197 players with their best 5-year average. These are distributed as follows. ELO Count 27xx 2 26xx 11 25xx 41 24xx 112 23xx 31 I combined the untitled players with the titleholders to get a list of 783 players covered by Elo's data. Of these, six had ratings over 2700. 2780 Fischer 2725 Capablanca 2725 Karpov 2720 Botvinnik 2720 Lasker 2700 Tal I matched the 49 players with ratings over 2600 against the list of players covered by UPITT game collections. I was pleasantly surprised to find that only one name was missing -- Tassilo von Heydebrand und der Lasa. Now I had to tackle the same 'Who's missing?' question for players active since 1978. I decided to use the rating list at... http://perso.wanadoo.fr/eric.delaire/Palmares.htm ...which is behind the link 'Elo international'. I downloaded the file MONDELO.ZIP and loaded its XLS file into my database. The file lists 226 players. For each FIDE rating list from January & July, it contains some ratings for some of those players. The rating lists are not complete and there is no obvious reason why some ratings were chosen and others were not. Although the file contains a few ratings back to January 1971, ratings prior to July 1983 are missing almost completely; there are only one or two per period. Using all of the ratings for each player in the file, my software calculated the average rating for the 226 players. I selected all players with average ratings greater than 2650 or with a career high greater than 2675. This yielded 34 names. Excluding the four players who were also on Elo's lists, four names broke the 2700 barrier. 2777 Kasparov 2731 Kramnik 2724 Anand 2706 Ivanchuk I matched the 30 remaining names to the UPITT collections. Five names having no collection on UPITT popped out - Epichine, Fedorov, Georgiev Kir., Gurevitch M., and Kasimdzhanov. That is impressive coverage by the UPITT collection. I had also planned to look at the PGN/Events files, but my time ran out. I'll try to tackle this another time. Best wishes for the holidays, Mark Weeks